Skip to content

<fix>[ha]: setup self-fencer earlier during host reconnect#4003

Open
MatheMatrix wants to merge 2 commits into
5.5.22from
sync/ya.wang/ZSTAC-69133@@2
Open

<fix>[ha]: setup self-fencer earlier during host reconnect#4003
MatheMatrix wants to merge 2 commits into
5.5.22from
sync/ya.wang/ZSTAC-69133@@2

Conversation

@MatheMatrix
Copy link
Copy Markdown
Owner

Summary: add HostBase after-connect hook, propagate one-time noStatusCheck, and fix CheckHostCapacityMsg timeout setup. Validation: root install, premium package, premium test-compile passed.

sync from gitlab !9896

This change is necessary because HA self-fencer setup needs to run
soon after the hypervisor connect hook completes during host reconnect.
This commit adds a host-level after-connect-hook extension point and
marks reconnect-generated connect messages so HA can distinguish real
host reconnects without changing existing connect single-flight behavior.
It also propagates the one-time status-check bypass to KVM self-fencer
setup paths and fixes host capacity check timeout setup.

DBImpact

Resolves: ZSTAC-69133

Change-Id: I7a73746163693639313333636f6d70757465

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

Warning

Rate limit exceeded

@MatheMatrix has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 37 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 124ed4d0-f420-4713-ba72-75591ffb8eb4

📥 Commits

Reviewing files that changed from the base of the PR and between 70a7aa4 and 9d2004f.

📒 Files selected for processing (4)
  • compute/src/main/java/org/zstack/compute/host/HostAfterConnectHookExtensionPoint.java
  • compute/src/main/java/org/zstack/compute/host/HostManagerImpl.java
  • header/src/main/java/org/zstack/header/host/HostConstant.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostCapacityExtension.java

Walkthrough

此 PR 为主机连接流程添加 after-connect hook 扩展点并在连接 Flow 中收集/执行/回滚这些 hook;在主机容量检查消息上显式设置 30 分钟超时;并为 KVM 自定义围栏参数添加 noStatusCheck 以控制状态检查行为。

Changes

主机连接流程与扩展

Layer / File(s) Summary
After-Connect Hook 扩展点定义
compute/src/main/java/org/zstack/compute/host/HostAfterConnectHookExtensionPoint.java
新增 HostAfterConnectHookExtensionPoint 接口,定义 createAfterConnectHookFlow(HostInventory host, ConnectHostInfo info, boolean reconnect) 方法,返回 Flow 实例。
主机连接流程与钩子执行
compute/src/main/java/org/zstack/compute/host/HostBase.java
新增 AFTER_CONNECT_HOOK_RECONNECT 常量;在重连路径将该标记写入 ConnectHostMsg header;在 connect 流程中新增 Flow:遍历所有 HostAfterConnectHookExtensionPoint、收集其返回的 hook flows、顺序执行并记录已运行状态以便在 rollback 阶段对已执行的 hook 逆序回滚;hook 执行失败则直接 fail。
消息超时设置(容量检查)
compute/src/main/java/org/zstack/compute/host/HostManagerImpl.java, plugin/kvm/src/main/java/org/zstack/kvm/KVMHostCapacityExtension.java
在创建 CheckHostCapacityMsg 后、发送前显式调用 msg.setTimeout(...) 将消息超时设置为 30 分钟(TimeUnit.MINUTES.toMillis(30) 在 KVM 扩展中引入)。
自定义围栏参数与状态检查控制
plugin/kvm/src/main/java/org/zstack/kvm/KvmSetupSelfFencerExtensionPoint.java, plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
KvmSetupSelfFencerParam 中新增布尔字段 noStatusCheck 及其 isNoStatusCheck() / setNoStatusCheck(boolean),并在 CephPrimaryStorageBase 的 self-fencer 调用处将该参数传入以控制是否跳过主机状态检查。

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 连接完毕小钩起,
超时稳当守时机,
围栏参数随心意,
钩行成链有回退,
春风拂过不惊奇。

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed 标题完全相关,清晰总结了主要变更:在主机重连期间更早地设置自篱笆,符合指定格式和字符限制。
Description check ✅ Passed 描述相关且信息量充足,涵盖了主要变更内容(after-connect hook、noStatusCheck传播、CheckHostCapacityMsg超时修复)。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sync/ya.wang/ZSTAC-69133@@2

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostAfterConnectHookExtensionPoint.java (1)

8-8: ⚡ Quick win

请为扩展点方法补充 Javadoc 契约说明。

这个接口方法会被插件实现并跨模块调用,建议补齐参数语义(尤其是 reconnect)和返回 Flow 的约束,避免实现方理解偏差。

♻️ 建议修改
 public interface HostAfterConnectHookExtensionPoint {
+    /**
+     * Create a hook flow that will be executed right after host connect.
+     *
+     * `@param` host connected host inventory snapshot
+     * `@param` info connect context
+     * `@param` reconnect true when this connect is triggered by reconnect flow
+     * `@return` hook flow, or null if no hook is needed
+     */
     Flow createAfterConnectHookFlow(HostInventory host, ConnectHostInfo info, boolean reconnect);
 }

As per coding guidelines “接口方法不应有多余的修饰符(例如 public),且必须配有有效的 Javadoc 注释。”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@compute/src/main/java/org/zstack/compute/host/HostAfterConnectHookExtensionPoint.java`
at line 8, 为接口方法 createAfterConnectHookFlow(HostInventory host, ConnectHostInfo
info, boolean reconnect) 补充规范的 Javadoc,删除任何多余的修饰符并在注释中明确说明:参数 host 和 info
的语义,reconnect 表示这是一个重连(true)还是首次连接(false)的场景;对返回值 Flow
的契约进行说明(例如执行时序、异常处理/回滚约定、是否可为 null 或必须返回非空 Flow、调用者如何使用返回的
Flow),以及线程/异步执行注意事项,并在注释中举例或提示实现者应遵循的行为准则,以避免跨模块实现歧义;保留方法签名不变,仅补充/替换为完整 Javadoc
注释。
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@compute/src/main/java/org/zstack/compute/host/HostAfterConnectHookExtensionPoint.java`:
- Line 8: 为接口方法 createAfterConnectHookFlow(HostInventory host, ConnectHostInfo
info, boolean reconnect) 补充规范的 Javadoc,删除任何多余的修饰符并在注释中明确说明:参数 host 和 info
的语义,reconnect 表示这是一个重连(true)还是首次连接(false)的场景;对返回值 Flow
的契约进行说明(例如执行时序、异常处理/回滚约定、是否可为 null 或必须返回非空 Flow、调用者如何使用返回的
Flow),以及线程/异步执行注意事项,并在注释中举例或提示实现者应遵循的行为准则,以避免跨模块实现歧义;保留方法签名不变,仅补充/替换为完整 Javadoc
注释。

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: ea14a67b-1783-4743-b226-4489ac47d821

📥 Commits

Reviewing files that changed from the base of the PR and between 27513a5 and 50c796f.

📒 Files selected for processing (6)
  • compute/src/main/java/org/zstack/compute/host/HostAfterConnectHookExtensionPoint.java
  • compute/src/main/java/org/zstack/compute/host/HostBase.java
  • compute/src/main/java/org/zstack/compute/host/HostManagerImpl.java
  • plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageBase.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostCapacityExtension.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KvmSetupSelfFencerExtensionPoint.java

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
compute/src/main/java/org/zstack/compute/host/HostManagerImpl.java (1)

821-821: ⚡ Quick win

建议把容量检查超时提取为统一常量,避免跨模块硬编码漂移。

Line 821 与 KVM 扩展里都写死了 30 分钟,后续改超时时容易出现不一致。建议提取为共享常量(或统一走已有全局配置)后复用。

可参考的最小改动
 public class HostManagerImpl extends AbstractService implements HostManager, ManagementNodeChangeListener,
         ManagementNodeReadyExtensionPoint, FindSameNodeExtensionPoint {
+    private static final long CHECK_HOST_CAPACITY_TIMEOUT_MILLIS = TimeUnit.MINUTES.toMillis(30);
@@
-            msg.setTimeout(TimeUnit.MINUTES.toMillis(30));
+            msg.setTimeout(CHECK_HOST_CAPACITY_TIMEOUT_MILLIS);

As per coding guidelines "避免使用魔法值(Magic Value)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@compute/src/main/java/org/zstack/compute/host/HostManagerImpl.java` at line
821, Hardcoded 30-minute timeout used via
msg.setTimeout(TimeUnit.MINUTES.toMillis(30)) in HostManagerImpl (and the KVM
extension) should be replaced with a shared constant: introduce a single
constant (e.g., CAPACITY_CHECK_TIMEOUT_MS or CAPACITY_CHECK_TIMEOUT_MINUTES) in
a common config/const class (or reuse an existing global timeout config) and
reference that constant from HostManagerImpl.msg.setTimeout(...) and the KVM
extension so both modules use the same value and avoid magic numbers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@compute/src/main/java/org/zstack/compute/host/HostManagerImpl.java`:
- Line 821: Hardcoded 30-minute timeout used via
msg.setTimeout(TimeUnit.MINUTES.toMillis(30)) in HostManagerImpl (and the KVM
extension) should be replaced with a shared constant: introduce a single
constant (e.g., CAPACITY_CHECK_TIMEOUT_MS or CAPACITY_CHECK_TIMEOUT_MINUTES) in
a common config/const class (or reuse an existing global timeout config) and
reference that constant from HostManagerImpl.msg.setTimeout(...) and the KVM
extension so both modules use the same value and avoid magic numbers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: http://open.zstack.ai:20001/code-reviews/zstack-cloud.yaml (via .coderabbit.yaml)

Review profile: CHILL

Plan: Pro

Run ID: 570509f2-8e45-441f-a95c-35663c0dac3c

📥 Commits

Reviewing files that changed from the base of the PR and between 50c796f and 70a7aa4.

📒 Files selected for processing (2)
  • compute/src/main/java/org/zstack/compute/host/HostManagerImpl.java
  • plugin/kvm/src/main/java/org/zstack/kvm/KVMHostCapacityExtension.java

@MatheMatrix
Copy link
Copy Markdown
Owner Author

Comment from yaohua.wu:

Review: MR !9896 — ZSTAC-69133

背景

ZSTAC-69133(P0,5.5.22 必须解):HA self-fencer 在 host 重连后被触发得太晚(要等整条 reconnect FlowChain 跑完、host 切到 Connected 才在 afterHostConnected 里 setup),造成心跳真空窗口与 HA 误杀风险。本 MR(zstack 侧)新增 HostAfterConnectHookExtensionPoint(执行点在 hypervisor connectHook() 之后、call-pre-connect-extensions 之前),重连时通过 header afterConnectHookReconnect 传递重连语义,并补齐两处 CheckHostCapacityMsg 的超时设置。整体方向与 Jira 修复方案一致,新增 flow 的 wrapper / 逆序 rollback 实现严谨,premium 侧配套了完整集成用例。

P0 — Critical

无。

🟡 Warning

W1. ConnectHostMsg single-flight signature 未加入 reconnect 维度(fix 完整性)
Jira 修复方案 item 3 明确要求"ConnectHostMsg single-flight signature 增加 reconnect 维度,避免普通 connect 与 reconnect 同 host 合并后丢失 reconnect 语义",并把关键代码位置标注为 HostBase.java:1238。但本 MR 的 HostBase.java 仅含三处改动(常量 AFTER_CONNECT_HOOK_RECONNECTputHeaderEntry、新增 flow),未见 1238 附近 single-flight signature 的变更。

  • 重连标记写在 ConnectHostMsg 的 header 里。若 connect 的 single-flight 为"合并"语义(同 host 并发 connect 合并为一次执行),并发的普通 connect(无 header)与 reconnect(有 header)合并后 reconnect marker 可能丢失 → createAfterConnectHookFlow 收到 reconnect=false → HA 跳过早期 setup → 退化为旧的晚期 afterHostConnected setup,即本 P0 在该竞态下不生效。
  • 请确认该 signature 改动是有意省略(如 connect single-flight 实为串行而非合并、各 msg 独立保留 header)还是遗漏。若为合并语义,建议按 Jira item 3 给 signature 增加 reconnect 维度。

🟢 Suggestion

S1. 30 分钟超时为重复魔法值HostManagerImpl.java:821KVMHostCapacityExtension.reportCapacity() 均硬编码 TimeUnit.MINUTES.toMillis(30)。建议抽取为共享常量(如 HostConstant 中的 CHECK_HOST_CAPACITY_TIMEOUT),避免后续两处漂移。

S2. 新增 SPI 缺少 JavadocHostAfterConnectHookExtensionPoint.createAfterConnectHookFlow 是跨模块扩展点,建议补充 Javadoc 说明 reconnect 语义、返回 null 的含义、返回 Flow 的回滚/异步约定,避免实现方理解偏差(项目编码规范要求接口方法配 Javadoc)。

🟢 advisory

目标分支 5.5.22 已较分支点推进 29 个 commit,HostBase.java 为高频改动文件。GitLab 报告 can_be_merged 无冲突,建议合并前 rebase 并重跑 reconnect 相关用例确认无语义冲突。

关联 MR

本次为跨仓联合修复(源分支 ZSTAC-69133@@2),需与 premium MR 一并合入:

  • premium !13938 — premium 侧将 HA 接入新扩展点(HaHostExtension implements HostAfterConnectHookExtensionPoint)、按 PrimaryStorageHostRefVO 连接状态过滤待 setup 的 PS、区分 early/late 跳过状态避免重复 setup,并新增 HaReconnectSelfFencerCase 集成用例。
  • 跨仓依赖:premium HaHostExtension 直接 implements org.zstack.compute.host.HostAfterConnectHookExtensionPoint(本 MR 新增的接口),ha.xml 注册该扩展点 —— 两个 MR 必须同时合入,premium 单独合入会编译失败。
  • 跨仓一致性:本 MR 在 KvmSetupSelfFencerParam 新增 noStatusCheck 字段,premium SelfFencerKvmBackend 负责把 SelfFencerStruct.noStatusCheck 透传到 KvmSetupSelfFencerParam 与 KVM 消息;本 MR 的 CephPrimaryStorageBase / SharedBlockGroupPrimaryStorageBase 通过 new KvmCommandSender(hostUuid, noStatusCheck) 消费,接口对齐一致。

Verdict: REVISION_REQUIRED

无 Critical。W1 需作者确认 single-flight signature 是否有意省略 —— 这关系到 P0 修复在并发场景下是否真正生效,作为 P0 bugfix 建议在合并前澄清。S1/S2/advisory 不阻塞。


🤖 Robot Reviewer

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@MatheMatrix MatheMatrix force-pushed the sync/ya.wang/ZSTAC-69133@@2 branch from 70a7aa4 to 9d2004f Compare May 18, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants